A brief overview of runtime analysis, data structures, and hashing

Common terms in computer science and engineering:

  • "List"
  • Array, Linked List
  • Worst case runtimes
  • O(n) (linear)
  • Hashing
  • Hash table
  • P and NP

Related terms in Python:

  • list
  • dict (dictionary)
  • set

In [ ]:

Motivation: Does some word exist in some book?

Case study, Leaves of Grass by Walt Whitman


In [ ]:
def print_lines(filename, num_lines):
    count = 0
    with open(filename,'r') as f:
        for line in f:
            for word in line.split():
                print word
            count += 1
            if count >= num_lines:
                break

In [ ]:
print_lines('leaves-of-grass.txt', 4)

In [ ]:
import re

with open('leaves-of-grass.txt','r') as f:
    words = [word.lower() for line in f for word in re.findall(r"[\w']+|[.,!?;]", line)]

In [ ]:
print "number of words =", len(words)
print words[0]
print words[12305]
print words[148185]
print words[148186]

How do we test "contains"

Python gives us the "in" keyword for lists


In [ ]:
print 'leaves' in words

In [ ]:
print 'leaVes' in words

In [ ]:
print 'captain' in words

In [ ]:
print 'internet' in words

Without 'in', how do we do this?

We have no idea how Python is doing this under the hood. Let's try this without any python keywords.


In [ ]:
def contains(wordlist, word):
    if len(wordlist) == 0:
        # Wordlist is empty
        return False
    for i in xrange(len(wordlist)):
        element = wordlist[i]
        if word == element:
            return True
    # No match in wordlist
    return False

In [ ]:
print contains(words, 'captain')

In [ ]:
print contains(words, 'internet')

Is this fast? What does fast mean?

Let's say that the most expensive operation for us is a single comparison, word == element

In the best case, we match on the first word.

Let's add a counter to our contains function.


In [ ]:
def contains(wordlist, word):
    if len(wordlist) == 0:
        # Wordlist is empty
        print "Checked 1 word(s)"
        return False
    for i in xrange(len(wordlist)):
        element = wordlist[i]
        if word == element:
            print "Checked", i+1, "word(s)"
            return True
    # No match in wordlist
    print "Checked", i+1, "word(s)"
    return False

In [ ]:
print contains(words, 'leaves')

In [ ]:
print contains(words, 'captain')

In the worst case, we have to check all words.


In [ ]:
print contains(words, 'internet')

What if each comparison took 1 second?


In [ ]:
print "Worst case is %.1f days to query 'words'" % (len(words)/60./60/24)

This is less than a MB of text! What if we wanted to search through every book in the library of congress?

Runtime Analysis

Determining average and worst-case runtimes can be complicated, but for a simple for loop it is easy. Since our worst case runtime is bounded by the number of words ($n$) in the list, inceasing the number of words in the list increases the runtime by a linear factor, i.e. for $2*n$, the worst case is $2*n$ operations.

Other algorithms will scale in different ways, for example $n^2$ or even $2^n$. There are more details, but these runtimes are referred to as "Big O", so a linear algorithm is $O(n)$.

How can we do better?

What is faster than $O(n)$?

If all we can do is access the list by index, we can't do any better given our python list. We can actually do much better if the list is sorted, but sorting a list is worst case $O(n*\log n)$, which is even slower than our look up method.

Simpler problem: list of ints

Let's say that instead of strings, our "words" are just integers randomly chosen from 0 to 1 million


In [ ]:
import random

In [ ]:
print random.randint(0, 10**6)
print random.randint(0, 10**6)
print random.randint(0, 10**6)
print random.randint(0, 10**6)

In [ ]:
n = len(words)
print n

In [ ]:
words = [random.randint(0, 10**6 - 1) for i in range(n)]

In [ ]:
words[0:10]

We can still use our function to check contains, because python is forgiving that way.


In [ ]:
contains(words, 1)

In [ ]:
contains(words, 9999)

In [ ]:
contains(words, 100000)

In [ ]:
contains(words, 413351)

In [ ]:

Use case: fast queries

We can afford to process the data for 1 day, but queries should run in 1 second. How can we do this for arbitary lists.

Since we know that the number of integers is bounded by [0,100000) we can make a custom data structure to represent our unsorted set of words. Its form is a list, where the index is the integer in question, and the data is a boolean representing true or false (in the set, not in the set.)

This type of data structure is sometimes called a bit map. It maps the integer field to a "bit" (yes or no)


In [ ]:
# Start with the empty set of ints from 1 to 999,999
words_map = [False for i in range(1000000)]

In [ ]:
words_map[0:10]

If we want to say the integer 2 is in our word set, we set words_map[2] = True


In [ ]:
words_map[2] = True

In [ ]:
words_map[0:10]

Starting from our empty set, let's load our wordslist in


In [ ]:
words_map = [False for i in range(1000000)]
for i in words:
    words_map[i] = True

In [ ]:


In [ ]:
contains(words, 219943)

In [ ]:
words_map[219943]

We did that look up only checking one entry!

Now we can write a contains function that does the same thing!


In [ ]:
def contains_map(word_bitmap, word):
    # Caution, only works on integers up to len(word_bitmap)!
    print "Checked 1 word(s)"
    return word_bitmap[word]

In [ ]:
contains(words, 219943)

In [ ]:
contains_map(words_map, 219943)

In [ ]:
contains(words, 99149)

In [ ]:
contains_map(words_map, 129343)

In [ ]:

Summary

We've figured out how to solve the query problem for a set of integer "words", but only if drawing from a fixed number of ints, e.g. 0 to 999,999

How might we apply this to the book problem?

  • Maybe we only care about words in the English language
    • ~200,000 possible words, can make bitmap of size 200,000
    • Have some dictionary code for each word, e.g. aardvark = 3, zebra = 199,847
    • If you searched for a word outside the vocab, answer would be "I don't know"
    • New words are entering the language constantly!
  • We could make the bitmap big enough to store every possible word formed with lower case letters
    • There are 26 characters. There are $26^k$ possible strings of length $k$

In [ ]:
for i in range(2,10):
    print "Word of length %d has %d possible combinations" % (i, 26**i)

The modulus

We'll use the remainder of division by some number to reduce the infinite space of things to a list with length of our choosing.


In [ ]:
4/2

In [ ]:
4%2

In [ ]:
5/2

In [ ]:
5%2

In [ ]:
2000/1000

In [ ]:
2999/1000

In [ ]:
2999%1000

In [ ]:
3000/1000

If we take $N % k$, the number will always be between $0$ and $k - 1$


In [ ]:
k = 100000

In [ ]:
words_map = [False for i in range(k)]

In [ ]:
words_map

In [ ]:
len(words_map)

Populate our word_map


In [ ]:
for i in words:
    words_map[i%k] = True

In [ ]:
words_map

In [ ]:


In [ ]:
words_map[5]

In [ ]:
5 % k

In [ ]:
100005 % k

This is called a collision. In a basic hashing bitmap, we can only say with certainty is a number is NOT in the map


In [ ]:
def contains_modmap(word_modmap, k, word):
    print "Checked 1 word(s)"
    if word_modmap[word % k]:
        return "Maybe"
    else:
        return False

In [ ]:
contains_modmap(words_map, 100000, 219943)

In [ ]:


In [ ]:


In [ ]:

How can we get yes/no?

Instead of a bitmap to True/False, let's store the data directly! One list per entry in the map.


In [ ]:
k = 100000

In [ ]:
words_map = [list() for i in range(k)]

In [ ]:
words_map

How do we add a number to words_map?

  • Use mod to determine the index
  • Add the number itself to the list at the index

In [ ]:
number = 3

In [ ]:
word_list = words_map[3 % k]

In [ ]:
word_list.append(3)

In [ ]:


In [ ]:
words_map

In [ ]:


In [ ]:
number = 100003

In [ ]:
word_list = words_map[number % k]
word_list.append(number)

In [ ]:
words_map

In [ ]:

Query, is number in word_map?


In [ ]:
def contains_modmap(word_modmap, k, word):
    word_list = word_modmap[word % k]
    print contains(word_list, word)

In [ ]:
contains_modmap(words_map, 100000, 100003)

In [ ]:
words_map[3 % 100000]

In [ ]:

We just made a hash table!

A "Hash function" is any function that converts some object of arbitrary size to number of fixed size.

Our hash function is just modulus!


In [ ]:
def our_hash(n):
    k = 100000
    return n % k

We set k equal to the length of our array. We can use any size of array by changing the mod k.


In [ ]:

More about hash functions

There are many different hash functions. A good hash function will evenly distribute hashed values onto the fixed space.

Python has a built-in hash function for exactly this purpose!


In [ ]:
hash(0)

In [ ]:
hash(1)

In [ ]:
hash("Ryan")

In [ ]:
hash("Ryan") % 10

In [ ]:

Back to strings

Armed with python "hash" and the modulus, we can finish our hash table for words.


In [ ]:
# Load the book in again

In [ ]:
with open('leaves-of-grass.txt','r') as f:
    words = [word.lower() for line in f for word in re.findall(r"[\w']+|[.,!?;]", line)]

In [ ]:
k = 100000
words_hashtable = [list() for i in range(k)]

for word in words:
    word_list = words_hashtable[hash(word) % k]
    word_list.append(word)

In [ ]:
words_hashtable

In [ ]:
def contains_hashtable(words_hashtable, k, word):
    word_list = words_hashtable[hash(word) % k]
    print contains(word_list, word)

In [ ]:
print contains(words, 'captain')

In [ ]:
contains_hashtable(words_hashtable, 100000, 'lincoln')

In [ ]:


In [ ]:

Hash table Load Factor

$\lambda = \frac{N_{elements}}{M_{capacity}}$

Typical values are around .5 to .75 to give good performance.


In [ ]:

The end!


In [ ]:


In [ ]:


In [ ]:

Bonus: Sorted lists


In [ ]:
print "leaves" < "zebra"

In [ ]:
print "leaves" < "abc"

In [ ]:
words.sort()

In [ ]:
print words[0]

In [ ]:
print words[30000]

In [ ]:
print words[90000]

In [ ]:
print words[148186]

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:
book.__contains__("captain")